{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyTorch Tutorial\n",
    "MILA, November 2017\n",
    "\n",
    "By Sandeep Subramanian"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Torch Autograd, Variables, Define-by-run & Execution Paradigm\n",
    "\n",
    "Adapted from\n",
    "1. http://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py \n",
    "2. http://pytorch.org/docs/master/notes/autograd.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variables : Thin wrappers around tensors to facilitate autograd\n",
    "\n",
    "Supports almost all operations that can be performed on regular tensors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from __future__ import print_function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import torch \n",
    "from torch.autograd import Variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![caption](images/Variable.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrap tensors in a Variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Variable containing:\n",
      "-0.2456 -0.0608 -0.7359\n",
      "-0.8375 -0.3687 -0.6179\n",
      "-0.1984 -0.2076 -0.7292\n",
      " 0.4198 -0.3215  0.9470\n",
      "-0.3811 -0.0531  0.9047\n",
      "[torch.FloatTensor of size 5x3]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "z = Variable(torch.Tensor(5, 3).uniform_(-1, 1))\n",
    "print(z)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Properties of Variables : Requiring gradients, Volatility, Data & Grad\n",
    "\n",
    "1. You can access the raw tensor through the .data attribute\n",
    "2. Gradient of the loss w.r.t. this variable is accumulated into .grad.\n",
    "3. Stay tuned for requires_grad and volatile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requires Gradient : False \n",
      "Volatile : False \n",
      "Gradient : None \n",
      "\n",
      "-0.2456 -0.0608 -0.7359\n",
      "-0.8375 -0.3687 -0.6179\n",
      "-0.1984 -0.2076 -0.7292\n",
      " 0.4198 -0.3215  0.9470\n",
      "-0.3811 -0.0531  0.9047\n",
      "[torch.FloatTensor of size 5x3]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print('Requires Gradient : %s ' % (z.requires_grad))\n",
    "print('Volatile : %s ' % (z.volatile))\n",
    "print('Gradient : %s ' % (z.grad))\n",
    "print(z.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([5, 5])\n"
     ]
    }
   ],
   "source": [
    "### Operations on Variables\n",
    "x = Variable(torch.Tensor(5, 3).uniform_(-1, 1))\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1))\n",
    "# matrix multiplication\n",
    "z = torch.mm(x, y)\n",
    "print(z.size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define-by-run Paradigm\n",
    "\n",
    "The torch autograd package provides automatic differentiation for all operations on Tensors.\n",
    "\n",
    "PyTorch's autograd is a reverse mode automatic differentiation system.\n",
    "\n",
    "Backprop is defined by how your code is run, and that every single iteration can be different.\n",
    "\n",
    "Other frameworks that adopt a similar approach :\n",
    "\n",
    "1. Chainer - https://github.com/chainer/chainer\n",
    "2. DyNet - https://github.com/clab/dynet\n",
    "3. Tensorflow Eager - https://research.googleblog.com/2017/10/eager-execution-imperative-define-by.html\n",
    "\n",
    "### How autograd encodes execution history\n",
    "\n",
    "\n",
    "Conceptually, autograd maintains a graph that records all of the operations performed on variables as you execute your operations. This results in a directed acyclic graph whose leaves are the input variables and roots are the output variables. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![caption](images/dynamic_graph.gif)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GIF source: https://github.com/pytorch/pytorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Internally, autograd represents this graph as a graph of Function objects (really expressions), which can be `apply()` ed to compute the result of evaluating the graph. When computing the forwards pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient (the `.grad_fn` attribute of each Variable is an entry point into this graph). When the forwards pass is completed, we evaluate this graph in the backwards pass to compute the gradients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<torch.autograd.function.AddmmBackward object at 0x7fd21a6d89e0>\n"
     ]
    }
   ],
   "source": [
    "x = Variable(torch.Tensor(5, 3).uniform_(-1, 1))\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1))\n",
    "z = torch.mm(x, y)\n",
    "print(z.grad_fn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important thing to note is that the graph is recreated from scratch at every iteration, and this is exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don’t have to encode all possible paths before you launch the training - what you run is what you differentiate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting gradients : `backward()` & `torch.autograd.grad`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)\n",
    "y = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)\n",
    "z = x ** 2 + 3 * y\n",
    "z.backward(gradient=torch.ones(5, 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       " 1  1  1\n",
       " 1  1  1\n",
       " 1  1  1\n",
       " 1  1  1\n",
       " 1  1  1\n",
       "[torch.ByteTensor of size 5x3]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# eq computes element-wise equality\n",
    "torch.eq(x.grad, 2 * x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Variable containing:\n",
       " 3  3  3\n",
       " 3  3  3\n",
       " 3  3  3\n",
       " 3  3  3\n",
       " 3  3  3\n",
       "[torch.FloatTensor of size 5x3]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.grad"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)\n",
    "y = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)\n",
    "z = x ** 2 + 3 * y\n",
    "dz_dx = torch.autograd.grad(z, x, grad_outputs=torch.ones(5, 3))\n",
    "dz_dy = torch.autograd.grad(z, y, grad_outputs=torch.ones(5, 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define-by-run example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Common Variable definition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)\n",
    "w = Variable(torch.Tensor(3, 10).uniform_(-1, 1), requires_grad=True)\n",
    "b = Variable(torch.Tensor(10,).uniform_(-1, 1), requires_grad=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Graph 1 : `wx + b`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "o = torch.matmul(x, w) + b\n",
    "do_dinputs_1 = torch.autograd.grad(o, [x, w, b], grad_outputs=torch.ones(5, 10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gradients of o w.r.t inputs in Graph 1\n",
      "do/dx : \n",
      "\n",
      " Variable containing:\n",
      " 1.5226 -1.3615  0.7991\n",
      " 1.5226 -1.3615  0.7991\n",
      " 1.5226 -1.3615  0.7991\n",
      " 1.5226 -1.3615  0.7991\n",
      " 1.5226 -1.3615  0.7991\n",
      "[torch.FloatTensor of size 5x3]\n",
      " \n",
      "do/dw : \n",
      "\n",
      " Variable containing:\n",
      " 0.3361  0.3361  0.3361  0.3361  0.3361  0.3361  0.3361  0.3361  0.3361  0.3361\n",
      "-1.1158 -1.1158 -1.1158 -1.1158 -1.1158 -1.1158 -1.1158 -1.1158 -1.1158 -1.1158\n",
      " 1.2694  1.2694  1.2694  1.2694  1.2694  1.2694  1.2694  1.2694  1.2694  1.2694\n",
      "[torch.FloatTensor of size 3x10]\n",
      " \n",
      "do/db : \n",
      "\n",
      " Variable containing:\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      " 5\n",
      "[torch.FloatTensor of size 10]\n",
      " \n"
     ]
    }
   ],
   "source": [
    "print('Gradients of o w.r.t inputs in Graph 1')\n",
    "print('do/dx : \\n\\n %s ' % (do_dinputs_1[0]))\n",
    "print('do/dw : \\n\\n %s ' % (do_dinputs_1[1]))\n",
    "print('do/db : \\n\\n %s ' % (do_dinputs_1[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Graph 2 : wx / b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "o = torch.matmul(x, w) / b\n",
    "do_dinputs_2 = torch.autograd.grad(o, [x, w, b], grad_outputs=torch.ones(5, 10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Gradients of o w.r.t inputs in Graph 2\n",
      "do/dx : \n",
      " Variable containing:\n",
      " 47.6666 -31.6417 -54.9581\n",
      " 47.6666 -31.6417 -54.9581\n",
      " 47.6666 -31.6417 -54.9581\n",
      " 47.6666 -31.6417 -54.9581\n",
      " 47.6666 -31.6417 -54.9581\n",
      "[torch.FloatTensor of size 5x3]\n",
      " \n",
      "do/dw : \n",
      " Variable containing:\n",
      "\n",
      "Columns 0 to 7 \n",
      " 25.7204  -1.4251   0.5816   0.7336   0.3829  -0.5467   0.3904   0.3968\n",
      "-85.3804   4.7306  -1.9306  -2.4353  -1.2709   1.8149  -1.2960  -1.3172\n",
      " 97.1318  -5.3817   2.1963   2.7705   1.4459  -2.0647   1.4743   1.4985\n",
      "\n",
      "Columns 8 to 9 \n",
      " -4.2812  -0.4352\n",
      " 14.2118   1.4446\n",
      "-16.1679  -1.6434\n",
      "[torch.FloatTensor of size 3x10]\n",
      " \n",
      "do/db : \n",
      " Variable containing:\n",
      "-173.4977\n",
      "   5.7707\n",
      "   5.5410\n",
      "  -2.9863\n",
      "  -0.3088\n",
      "  -1.7527\n",
      "  -0.5234\n",
      "  -1.5729\n",
      "-261.3063\n",
      "  -0.8931\n",
      "[torch.FloatTensor of size 10]\n",
      " \n"
     ]
    }
   ],
   "source": [
    "print('Gradients of o w.r.t inputs in Graph 2')\n",
    "print('do/dx : \\n %s ' % (do_dinputs_2[0]))\n",
    "print('do/dw : \\n %s ' % (do_dinputs_2[1]))\n",
    "print('do/db : \\n %s ' % (do_dinputs_2[2]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient buffers: `.backward()` and `retain_graph=True`\n",
    "\n",
    "1. Calling `.backward()` clears the current computation graph.\n",
    "2. Once `.backward()` is called, intermediate variables used in the construction of the graph are removed.\n",
    "2. This is used implicitly to let PyTorch know when a new graph is to be built for a new minibatch. This is built around the forward and backward pass paradigm.\n",
    "3. To retain the graph after the backward pass use `loss.backward(retain_graph=True)`. This lets you re-use intermediate variables to potentially compute a secondary loss after the initial gradients are computed. This is useful to implement things like the gradient penalty in WGANs (https://arxiv.org/abs/1704.00028)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "o = torch.mm(x, w) + b\n",
    "o.backward(torch.ones(5, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Call backward again -> <font color='red'>This fails</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "RuntimeError",
     "evalue": "Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-24-ce042d8d2cd4>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mo\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mo\u001b[0m \u001b[0;34m**\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mo\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mones\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m10\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/home/sandeep/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.pyc\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(self, gradient, retain_graph, create_graph, retain_variables)\u001b[0m\n\u001b[1;32m    156\u001b[0m                 \u001b[0mVariable\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    157\u001b[0m         \"\"\"\n\u001b[0;32m--> 158\u001b[0;31m         \u001b[0mtorch\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mautograd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgradient\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretain_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcreate_graph\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mretain_variables\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    159\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    160\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mregister_hook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhook\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/sandeep/anaconda2/lib/python2.7/site-packages/torch/autograd/__init__.pyc\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(variables, grad_variables, retain_graph, create_graph, retain_variables)\u001b[0m\n\u001b[1;32m     97\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     98\u001b[0m     Variable._execution_engine.run_backward(\n\u001b[0;32m---> 99\u001b[0;31m         variables, grad_variables, retain_graph)\n\u001b[0m\u001b[1;32m    100\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    101\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/sandeep/anaconda2/lib/python2.7/site-packages/torch/autograd/function.pyc\u001b[0m in \u001b[0;36mapply\u001b[0;34m(self, *args)\u001b[0m\n\u001b[1;32m     89\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     90\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 91\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_forward_cls\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     92\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     93\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/home/sandeep/anaconda2/lib/python2.7/site-packages/torch/autograd/_functions/blas.pyc\u001b[0m in \u001b[0;36mbackward\u001b[0;34m(ctx, grad_output)\u001b[0m\n\u001b[1;32m     33\u001b[0m     \u001b[0;34m@\u001b[0m\u001b[0mstaticmethod\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     34\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mbackward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mctx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mgrad_output\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 35\u001b[0;31m         \u001b[0mmatrix1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmatrix2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mctx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msaved_variables\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     36\u001b[0m         \u001b[0mgrad_add_matrix\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrad_matrix1\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mgrad_matrix2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     37\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mRuntimeError\u001b[0m: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time."
     ]
    }
   ],
   "source": [
    "o = o ** 3\n",
    "o.backward(torch.ones(5, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  But with `retain_graph=True`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "o = torch.mm(x, w) + b\n",
    "o.backward(torch.ones(5, 10), retain_graph=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "o = o ** 3\n",
    "o.backward(torch.ones(5, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color='red'>WARNING:</font> Calling `.backward()` multiple times will accumulate gradients into `.grad` and NOT overwrite them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Excluding subgraphs from backward: requires_grad=False, volatile=True & .detach\n",
    "\n",
    "### `requires_grad=False`\n",
    "\n",
    "1. If there’s a single input to an operation that requires gradient, its output will also require gradient.\n",
    "\n",
    "2. Conversely, if all inputs don’t require gradient, the output won’t require it.\n",
    "\n",
    "3. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.\n",
    "\n",
    "4. This is potentially useful when you have part of a network that is pretrained and not fine-tuned, for example word embeddings or a pretrained imagenet model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=False)\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=False)\n",
    "z = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " o = x + y requires grad ? : False \n",
      " o = x + y + z requires grad ? : True \n"
     ]
    }
   ],
   "source": [
    "o = x + y\n",
    "print(' o = x + y requires grad ? : %s ' % (o.requires_grad))\n",
    "o = x + y + z\n",
    "print(' o = x + y + z requires grad ? : %s ' % (o.requires_grad))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `volatile=True`\n",
    "\n",
    "1. If a single input to an operation is volatile, the resulting variable will not have a `grad_fn` and so, the result will not be a node in the computation graph.\n",
    "\n",
    "2. Conversely, only if all inputs are not volatile, the output will have a `grad_fn` and be included in the computation graph.\n",
    "\n",
    "3. Volatile is useful when running Variables through your network during inference. Since it is fairly uncommon to go backwards through the network during inference, `.backward()` is rarely invoked. This means graphs are never cleared and hence it is common to run out of memory pretty quickly. Since operations on `volatile` variables are not recorded on the tape and therfore save memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(3, 5).uniform_(-1, 1), volatile=True)\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), volatile=True)\n",
    "z = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Graph : x + y\n",
      "o.requires_grad : False \n",
      "o.grad_fn : None \n",
      "\n",
      "\n",
      "Graph : x + y + z\n",
      "o.requires_grad : False \n",
      "o.grad_fn : None \n"
     ]
    }
   ],
   "source": [
    "print('Graph : x + y')\n",
    "o = x + y\n",
    "print('o.requires_grad : %s ' % (o.requires_grad))\n",
    "print('o.grad_fn : %s ' % (o.grad_fn))\n",
    "print('\\n\\nGraph : x + y + z')\n",
    "o = x + y + z\n",
    "print('o.requires_grad : %s ' % (o.requires_grad))\n",
    "print('o.grad_fn : %s ' % (o.grad_fn))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `.detach()`\n",
    "\n",
    "1. It is possible to detach variables from the graph by calling `.detach()`. \n",
    "2. This could lead to disconnected graphs. In which case PyTorch will only backpropagate gradients until the point of disconnection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)\n",
    "z = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![caption](images/detach.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "m1 = x + y\n",
    "m2 = z ** 2\n",
    "m1 = m1.detach()\n",
    "m3 = m1 + m2\n",
    "m3.backward(torch.ones(3, 5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dm3/dx \n",
      "\n",
      " None \n",
      "\n",
      "dm3/dy \n",
      "\n",
      " None \n",
      "\n",
      "dm3/dz \n",
      "\n",
      " Variable containing:\n",
      " 1.2043  1.9914  0.1340 -1.8074  1.3064\n",
      "-0.1923  0.9834 -1.9299  1.4948  0.6174\n",
      " 1.0566 -1.1677 -1.5411  0.5598 -1.9467\n",
      "[torch.FloatTensor of size 3x5]\n",
      " \n"
     ]
    }
   ],
   "source": [
    "print('dm3/dx \\n\\n %s ' % (x.grad))\n",
    "print('\\ndm3/dy \\n\\n %s ' % (y.grad))\n",
    "print('\\ndm3/dz \\n\\n %s ' % (z.grad))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Gradients w.r.t intermediate variables in the graph\n",
    "\n",
    "1. By default, PyTorch all gradient computations w.r.t intermediate nodes in the graph are ad-hoc.\n",
    "\n",
    "2. This is in the interest of saving memory.\n",
    "\n",
    "3. To compute gradients w.r.t intermediate variables, use `.retain_grad()` or explicitly compute gradients using `torch.autograd.grad`\n",
    "\n",
    "4. `.retain_grad()` populates the `.grad` attribute of the Variable while `torch.autograd.grad` returns a Variable that contains the gradients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)\n",
    "z = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "m1 = x + y\n",
    "m2 = z ** 2\n",
    "m1.retain_grad()\n",
    "m2.retain_grad()\n",
    "m3 = m1 * m2\n",
    "m3.backward(torch.ones(3, 5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dm3/dm1 \n",
      "\n",
      " Variable containing:\n",
      " 0.6986  0.0007  0.0314  0.3346  0.4070\n",
      " 0.7087  0.7066  0.0344  0.3643  0.4734\n",
      " 0.6196  0.0079  0.0240  0.4427  0.0531\n",
      "[torch.FloatTensor of size 3x5]\n",
      " \n",
      "dm3/dm2 \n",
      "\n",
      " Variable containing:\n",
      "-1.0104  1.6305  0.3076 -0.2957  0.1597\n",
      "-0.1984 -1.2168  0.4246 -1.3702  0.8474\n",
      "-1.1777  1.6642 -1.2514  0.8266  0.0997\n",
      "[torch.FloatTensor of size 3x5]\n",
      " \n"
     ]
    }
   ],
   "source": [
    "print('dm3/dm1 \\n\\n %s ' % (m1.grad))\n",
    "print('dm3/dm2 \\n\\n %s ' % (m2.grad))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### In place operations on variables in a graph\n",
    "\n",
    "source: http://pytorch.org/docs/master/notes/autograd.html\n",
    "\n",
    "In place operations are suffixed by `_` ex: `log_`, `uniform_` etc.\n",
    "\n",
    "1. Supporting in-place operations in autograd is difficult and PyTorch discourages their use in most cases.\n",
    "\n",
    "2. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.\n",
    "\n",
    "### There are two main reasons that limit the applicability of in-place operations:\n",
    "\n",
    "(a) Overwriting values required to compute gradients. This is why variables don’t support `log_`. Its gradient formula requires the original input, and while it is possible to recreate it by computing the inverse operation, it is numerically unstable, and requires additional work that often defeats the purpose of using these functions.\n",
    "\n",
    "(b) Every in-place operation actually requires the implementation to rewrite the computational graph. Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the Function representing this operation. This can be tricky, especially if there are many Variables that reference the same storage (e.g. created by indexing or transposing), and in-place functions will actually raise an error if the storage of modified inputs is referenced by any other Variable.\n",
    "In-place correctness checks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second and higher order derivatives\n",
    "\n",
    "### Computing gradients w.r.t gradients\n",
    "\n",
    "1. `o = xy + z`\n",
    "2. `l = o + do_dz`\n",
    "\n",
    "### Practical application of this in WGAN-GP later in the tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)\n",
    "y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)\n",
    "z = Variable(torch.Tensor(5, 5).uniform_(-1, 1), requires_grad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "do/dz \n",
      "\n",
      " : Variable containing:\n",
      "-0.0465  1.6061  0.6523  0.0604 -1.0861\n",
      "-0.1614 -1.8736  1.6718  1.7629 -1.4649\n",
      "-0.7555  0.2532 -1.8296 -1.5360  0.2838\n",
      " 1.1525  0.8089 -1.7133 -1.3501  1.7537\n",
      " 0.6360  1.3759 -1.7214  0.2242  0.5220\n",
      "[torch.FloatTensor of size 5x5]\n",
      " \n",
      "dl/dz \n",
      "\n",
      " : Variable containing:\n",
      "  1.9749   2.1041   2.5755   2.0140   1.6410\n",
      "  1.7873   1.8495   3.9286  10.6421   0.3371\n",
      "  0.2702   3.0303  -6.2525   1.6083   2.0623\n",
      "  2.8351   3.7686  -4.1183   1.6259   2.7881\n",
      "  2.2092   2.0311  -3.4778   2.0041   2.0320\n",
      "[torch.FloatTensor of size 5x5]\n",
      " \n"
     ]
    }
   ],
   "source": [
    "o = torch.mm(x, y) + z ** 2\n",
    "# if create_graph=False then the resulting gradient is volatile and cannot be used further to compute a second loss.\n",
    "do_dz = torch.autograd.grad(o, z, grad_outputs=torch.ones(5, 5), retain_graph=True, create_graph=True)\n",
    "print('do/dz \\n\\n : %s ' % (do_dz[0]))\n",
    "l = o ** 3 + do_dz[0]\n",
    "dl_dz = torch.autograd.grad(l, z, grad_outputs=torch.ones(5, 5))\n",
    "print('dl/dz \\n\\n : %s ' % (dl_dz[0]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
